# **BMP Image Edge Detector**

# **Design Review**

Submission Date: 03/29/2018 Thursday 14:30 Lab (Lab Section 11) TA: Reena Elangovan

# Prepared by:

Ni Kang Haoming Zhang Haobo Chen

#### 1. Executive Summary

Edge detection is a basic problem for image processing and computer visualization, whose aim is to denote the points which change obviously in digital image. The obvious changes in image properties usually reflect the property's important events and changes, which includes: discontinuity in depth, discontinuity on surface, the changes in matter properties and the brightness change in a specific scenario. Image edge detection drastically decreases the number of data and eliminates the irrelevant information which keeps the important properties of the image structure. There are a lot of ways to be used in edge detection, and they can be divided into two parts. The first part is to find the maximum and minimum value in the first derivation of the image. The second part is to find the edge through find the second derivation of the image. This result in a large amount of arithmetic and repeated processing, which, makes an ASIC design a perfect fit for the purpose. In addition, having a SoC design produced specifically for them function described above will drastically diminish the workload for processing images.

The design will utilize AHB Lite Interface, which is capable for both reading and writing. The AHB Lite is required for communication between systems in the design. Since edge detection requires large scale computation, AHB Lite itself won't be able to store all necessary data. All data will be stored in registers in order for edge detection. The edge detector will also have two dedicated core, one for image filtering, the other is for edge detection.

The successful design of the proposed edge detector will require the following resources:

- AHB Lite Datasheet
- Reference Standard Cell Simulation Library for Mapped Design Verification
- Reference Standard Cell Technology Library for Final Design Layout Verification
- Verilog HDL Simulation and Design Synthesis Tool Chain

The following documents will describe:

- Super high level block diagram of edge detection
- Edge detector address mapping
- Design pinout
- Design architecture

## 2. Design Specifications

### 2.1. System Usage

#### 2.1.1. System Usage Diagram



Figure 1: Example System Usage Diagram for Edge Detector

An example system illustrating the intended use of this edge detector is depicted above in Figure 1. In this system there is a main processor where any relevant software would be executed, DRAM where any intermediate values are stored in and edge detector for handling large scale calculation. The key operational ideas are that the software running on the processor would perform the following steps:

- 1. Configure the operational settings directly on the edge detector
- 3. Directly send the image data to the edge detector
- 4. Wait until edge detector's completion status bit is asserted
- 5. Output the calculated result to the data bus

#### 2.1.2. Implemented Standard(s) and Algorithm(s)

#### **Edge Detector**

Slave whose input is controlled by the processor

Sublaminar master that controls read and writing data to the DRAM

#### **AHB Lite Standard Slave**

256-bit data bus

Read and Write Transfers

**Burst Transfers Supported** 

**Pipelined Transfers** 

3-bit, 8 word-address namespace (0x000  $\rightarrow$  0x007)

# 2.1.3. Edge Detector Namespace Address Mapping

| Slave Word | Address<br>Read / Write | Data<br>Size<br>(Byte) | Description                                                                            |
|------------|-------------------------|------------------------|----------------------------------------------------------------------------------------|
| 0x000      | R/W                     | 4                      | Image width                                                                            |
| 0x001      | R/W                     | 4                      | Image height                                                                           |
| 0x002      | R/W                     | 4                      | Image header size                                                                      |
| 0x003      | R/W                     | 1                      | Config bit for completion of reading image width 1 -> complete 0 -> not complete       |
| 0x004      | R/W                     | 1                      | Config bit for completion of reading image height 1 -> complete 0 -> not complete      |
| 0x005      | R/W                     | 1                      | Config bit for completion of reading image header size 1 -> complete 0 -> not complete |
| 0x006      | R/W                     | 1                      | Config bit indicating that the process is ready to start 1 -> ready 0 -> not ready     |
| 0x007      | R                       | 4                      | Input Image Data Starting Location 1                                                   |

Table 1: Namespace Address Mapping Table

# 2.2. Design Pinout

| Signal Name | Direction | Number of bits | Description                         |
|-------------|-----------|----------------|-------------------------------------|
| vec         | PWR       | /              | Power Pin                           |
| gnd         | GND       | /              | Ground Pin                          |
| clk         | IN        | 1              | System clock(100MHz)                |
| n_rst       | IN        | 1              | Asynchronous Reset.<br>(Active Low) |

Table 2: Miscellaneous Pinout Table

| Signal Name  | Direction | Number of bits | Description                                                                                       |
|--------------|-----------|----------------|---------------------------------------------------------------------------------------------------|
| image_data   | IN        | 24             | Single pixel image input                                                                          |
| filter       | IN        | 18             | Sobel filter value                                                                                |
| process_done | OUT       | 1              | Set to high when process finished                                                                 |
| image_out    | OUT       | 24             | Single pixel image output                                                                         |
| data_ready   | IN        | 1              | Set to high when all configuration information is ready so that the data is ready to be processed |

Table 3: AHB Lite Interface Pins

#### 2.3. Operational Characteristics

#### 2.3.1. Image Data Input

Before data processing, all the image data will be stored in a buffer in the data bus as a two-dimensional array. The configuration data of the image will be read into the module. The configuration flags won't be set until the preparation procedure is complete. After all the configuration flags are asserted, the data\_ready flag is asserted. Only after then, the actual data processing can start. The procedure will be done in pipelined transfers such that the clock cycles needed for waiting is minimized.

#### 2.3.2. Window Buffer Initialization

Each pixel is represented by 3 bytes. Since by default AHB Lite bus process a word (4 bytes) at a time, the efficient way to read data from the data bus is to read a chunk of 12 bytes (4 pixels) at a time so that there is no redundant byte. In order to minimum the number of data reading, it is necessary to construct a 8x8 window buffer so that after processing each pixel, only the replacement of single row or column is needed for further processing.

During the initialization, the window buffer will load the first 8x8 pixel data at the upper right corner of the image. The reason for loading in this amount of data is articulated below.

• For any given pixel in the image, it requires the information from 8 pixels around it for applying the kernels that will be explained in the next couple sections. Thus, to process a 3x3 pixel group, it requires data from a 5x5 set including their surrounding pixels. However, as mentioned above, the easiest way to read data from AHD Lite is to read 4 pixels at a time. Taking efficiency into account, it is reasonable to take 2 sets of 4 pixels for the initialization.

#### 2.3.3. Window Buffer Iteration

For any further iterations, if the pixel of interest moves to the left or the right and doesn't exceed the current buffer window, the data don't need any update. When the pixel of interest and its surrounding pixels exceeds the side bounds of the buffer window. It is only necessary to shift the existing columns of values to the left by 4 bytes and load in a 4x8 set of data. If the pixel of interest moves downward, it behaves in a similar way as explained above.

Instead of starting at the beginning of each row, in order to minimize the amount of data updating, the pixel of interest will move in a zigzagging pattern such that only 32 pixels need to be updated each time except the initialization.

In the corner cases where the remaining of a row is not enough for 4 pixels or the remaining rows is less than 4, The data shift process will react accordingly.

#### 2.3.4. Gaussian Blur

For each pixel of interest, Gaussian blur will apply the following kernel to it.



Figure 2. Gaussian Blur Kernel

In such a way, the image gets blurred and only the major edges can be detected by the Sobel filter. Thus, the quality of the edge detection can be improved.

The blurred data will be sent to another window buffer for further processing.

#### 2.3.5. Window Buffer for Blurred Data

The Gaussian Blur module will produce a 6x6 matrix to the second window buffer, this window buffer operate in the same way as the previous one.

#### 2.3.5. Sobel Filter Operation

With the window buffer produced from Gaussian blue. The Sobel filter module apply the following filter kernels on each pixel of the image.



Figure 3. Sobel Filter Kernel

By applying the filter kernel in both horizontally and vertically, we will be able to find the gradient relation between the pixel of interest and its surrounding pixels in both directions.

#### 2.3.6. Gradient Computing Operation

After a group of gradients is found, the Pythagorean theorem is applied to the two gradient vectors. The calculated edge value is only necessary for comparing with the threshold, leave it as a squared value is acceptable.

$$edge^2 = \Delta x^2 + \Delta y^2$$

If the edge value is larger than a threshold of our choice, it is recognized as a valid edge and will output FF (i.e. white pixel). Otherwise it will output 00 (i.e. black pixel).

By performing the operation above, an image with only major edges will be generated.

#### 2.5. Requirements for Design

The current cell phone camera provides an image quality of about 8 megapixels. With given clock speed restriction, it is expected that the edge detection would finish below one second. With a clock speed of at least 200 MHz, it is a reasonable aim to complete the image processing in less than 1 second. This requires pipeline transfer of data and multitasking. Meaning that all tasks are done parallelly so that ideally no clock cycles are wasted for waiting for new data coming in. The pipelining will be handled by the state machine in the controller module to the greatest extend. The method to buffer 8x8 data matrix is selected so that the less data reading is performed such that no spaces are wasted. Moreover, the zigzagging pattern that the pixel of interest moves forward will also help to reduce the number of data reading.

In terms of area, since the image processing requires a large amount of registers to store intermediate values. Given the design described above, the number of register used will be approximately ##. The detailed distribution will be described in the budgeting session below.

When implementing the edge detector, the choice of threshold number is critical. The actual choice will be tested during the experiment. Applying the different thresholds of the design will affect the final output rate. If a large threshold is chosen, the output will be vague (the edges of the picture will not be shown clearly, there might be missing parts if the color gradient is not large enough). Otherwise, if the threshold is low, the output will be messy (the edge will include redundant edges, which will not perfectly outline the image). Thus, we have to find the perfect threshold of the bmp edge detector through different tests, which is also the first requirement of our design. Secondly, knowing the size of the output image is also critical. We have a one thousand samples flag to measure how many bytes were passed into and processed by the system. This flag will not only be used in the purpose of indicating how many bytes are passed in but will also be implemented for debugging purpose. This is the second main requirement for our design.

# 2.5.1 Area Budget

| Core Area Calculations      |         |         |              |                                                    |
|-----------------------------|---------|---------|--------------|----------------------------------------------------|
|                             |         | Gate    |              |                                                    |
| Name of                     | Categ   | /FF     | Area         | Comments                                           |
| Block                       | ory     | Cou     | (um2)        | Comments                                           |
|                             |         | nt      |              |                                                    |
|                             | Reg. w/ |         |              |                                                    |
| WindowBuffer1               | Reset   | 1536    | 3,686,400    | 8x8x24x1600x1.5                                    |
|                             | Reg. w/ |         |              |                                                    |
| WindowBuffer2               | Reset   | 864     | 2,073,600    | 6x6x24x1600x1.5                                    |
|                             | Combina |         |              | 8*24*2(adder)+3*4(counter)+(2+3)*10*3(divide       |
| Gaussian Blur               | tional  | 446     | 334,500      | r)                                                 |
|                             | Combina |         |              | 2 * 3*(8 bit substractor, 8 FA + 10 XOR gates) for |
| Sobel Filter                | tional  | 2496    | 1,872,000    | X , Y calculation, totally 16 pixels               |
|                             | Combina |         |              |                                                    |
| <b>Gradient Calculation</b> | tional  | 384     | 288,000      | x^2+y^2                                            |
| Total Core Area             |         |         | 8,254,500    |                                                    |
|                             | Ch      | ip Area | Calculations | (units in um or um2)                               |
| Number of I/O Pads:         |         | 12      |              |                                                    |
| I/O Pad Dimensions:         | 90      | by      | 300          |                                                    |
| I/O Based Padframe          |         | la v    |              |                                                    |
| Dimensions:                 | 870     | by      | 870          |                                                    |
| Core Dimensions             | 2873    | by      | 2873         |                                                    |
| Core Based                  |         |         |              |                                                    |
| Padframe                    |         | by      |              |                                                    |
| Dimensions:                 | 3773    |         | 3773         |                                                    |
| Final Padframe              |         | hv      |              |                                                    |
| Dimensions:                 | 3773    | by      | 3773         |                                                    |
| Final Chip Area:            | 1       | 4,236,0 | 16           |                                                    |

# 2.5.2 Timing Budget

| Starting<br>Component | Propaga<br>tion<br>Delay<br>(ns) | Combinatio<br>nal Logic | Propaga<br>tion<br>Delay<br>(ns) | Ending<br>Component | Setup Time or<br>Propagation<br>Delay (ns) | Total Path Delay (ns) | Target<br>Clock<br>Period<br>(ns) |
|-----------------------|----------------------------------|-------------------------|----------------------------------|---------------------|--------------------------------------------|-----------------------|-----------------------------------|
|                       |                                  |                         |                                  | WindowBuffer        |                                            |                       |                                   |
| Gaussian Blur         | 0                                | Gaussian Blur           | 89.2                             | 2                   | 0.8                                        | 90                    | 5                                 |
|                       |                                  |                         |                                  | Gradiant            |                                            |                       |                                   |
| Sobel Filter          | 0                                | Sobel Filter            | 15.6                             | Calculation         | 0                                          | 15.6                  | 5                                 |
| Gradiant              |                                  | Gradiant                |                                  |                     |                                            |                       |                                   |
| Calculation           | 0                                | Calculation             | 12.8                             | -                   | 0                                          | 12.8                  | 5                                 |
| WindowBuffer          |                                  |                         |                                  |                     |                                            |                       |                                   |
| 1                     | 0                                | -                       | 0                                | Gaussian Blur       | 0                                          | 0                     | 5                                 |
| WindowBuffer          |                                  |                         |                                  |                     |                                            |                       |                                   |
| 2                     | 0                                | -                       | 0                                | Sobel Filter        | 0                                          | 0                     | 5                                 |

#### 3. Design Implementation

#### 3.1. Design Architecture



Figure 4: Edge Detection Architecture Diagram

The intended implementation architecture is depicted above in Figure 2. A module handling the implementation of the AHB Lite interface is used to bridge between the external SoC environment and the internal data and status locations. The control unit will be used to control the pipeline transfer of the design. There will be two window buffers for storing the intermediate processing data, one is for storing the original data, the other is for storing the preprocessed data coming from the Gaussian Blur module. The Sobel Filter and Gradient module will serve all the core functions for the edge detector. There are no other storage devices except for the intermediate processed data. The algorithm will be controlled by a Finite-State-Machine with state counters for tracking the progression through the needed stages and handle the data transfer between modules with pipeline transfer feature.



Figure 5: Edge Detection Timing Diagram

The timing diagram shows the internal relationship between each blocks. The read enable signal on the original buffer going high indicates that the buffer is loading. While the read enable goes to 0, it means that shifting is done and the shift\_done flag gives a strobe to the next block, which is Gaussian block, to make it start doing the corresponding work. While the Gaussian block finishes the shifting and calculation, it sets the shift\_done signal to high and stores the value into the blurred buffer. When all data from Gaussian block was transmitted, the shift\_done in the blurred buffer sets to high to make the sobel block start working. Finally, when the sobel block finishes working, it shifts out the final output to the Gradient block, which is a purely combinational block.

### 3.2 Functional Block Diagram



Figure 6: Window Buffer Module Block Diagram



Figure 7: Window Buffer Module Timing Diagram

The window buffer will be utilized to store the original and blurred data. But the implementation is identical. First the dir\_sel flag will be used to decide the behavior of the shifting operation. If dir\_sel is 11, it means that this is the first time the module gets executed. Otherwise, it indicate the direction of shifting operation. If r\_enable is asserted, the image data will be read in serially and be stored in the empty parts of the shifted buffer. A combination of new data and old data will form the next buffer\_data to be calculated. The done\_flag is asserted only when the internal counter reaches it's rollover flag.

#### **Gaussian Blur Module**



The Gaussian Blur module will have two inputs from MCU: the 'read\_enable' signal to enable the counter (starts from 1) and the  $x_i$ ,  $y_i$  values to determine the center pixel for calculation. The 'calc\_enable' signal is to get to know if the calculation can be processed. If it is set, the calculated value will be shifted into the next window buffer. After reading the 3 \* 3 pixel data from the previous buffer, a Gaussian Blur calculation will be implemented by adding the total 9 numbers together and then divide the sum by 9 to get the average pixel value.  $X_i = x_i + 1$ ,  $x_i = x_i + 1$ . These two determines the

coordination on next window buffer. After calculation, the pixel will be transmitted to the next window buffer. And finally, when the  $x_0$  or  $y_0$  value reaches to the bottom-left index, return a gaussian\_done signal to MCU.

#### **Sobel Filter Module**

This is a purely combinational block, which will calculate two values for each pixel. The input of this module is a 6\*6 pixel array, and the output should be two arrays calculated by the middle 4\*4 pixels. One of the output arrays should be the horizontally calculated value of the 4 pixels, and the other one is the vertically calculated value of the 4 pixels. And the diagrams of the specific algorithm are shown above.

#### 4.1. Project Timeline and Division of Tasks

- Week of Apr 2nd: Write code of window buffer and testbench
- a. Code: Haobo Chen, Ni Kang
- b. Testbench: Haoming Zhang
- Week of Apr 9th: Write code of gaussian blur and testbench
- a. Code: Haoming Zhang
- b. Testbench: Haobo Chen, Ni Kang
- Week of Apr 16th: Write code of Sobel and Gradient and testbench
- a. Code: Haobo Chen, Ni Kang
- b. Testbench: Haoming Zhang
- Week of Apr 23rd: Write the final report and prepare the presentation

#### 4.2. Success Criteria

#### 4.2.1 Fixed Criteria: (Total of 12 SC Points)

- 1. (2 points) Test benches exist for all top-level components and the entire design. The test benches for the entire design can be demonstrated or documented to cover all of the functional requirements given in the design specific success criteria.
- 2. (4 points) Entire design synthesizes completely, without any inferred latches, timing arcs, and, sensitivity list warnings.
- 3. (2 points) Source and mapped version of the complete design behave the same for all test cases. The mapped version simulates without timing errors except at time zero.
- 4. (2 points) A complete IC layout is produced that passes all geometry and connectivity checks.
- 5. (2 points) The entire design complies with targets for area, pin count, throughput (if applicable), and clock rate. The final targets for these parameters will be determined by course staff based on your design review. Failure to reach any of the targets will result a score of 1 out of 2 provided that you are within 50% on area, 10% on pin count, and 25% on throughput. Doing worse in any category will result in a score of 0 out of 2.

#### 4.2.2 Design Specific Success Criteria: (Total of 8 SC Points)

- 1.(2 points) Demonstrate by simulation of a Verilog test bench that window buffer block works.
  - 2.(1 points) Demonstrate by simulation of a Verilog test bench that gaussian block works.
  - 3. (2 points) Demonstrate by simulation of a Verilog test bench that sobel filter works.
- 4. (2 points) Demonstrate by simulation of a Verilog test bench that gradient calculator works.
- 4. (1 points) Demonstrate by simulation of a Verilog test bench that the overall function works.

# **Excerpt of relevant data sheet:**



Figure 1-4 Example multi-layer AHB-Lite block diagram

| Name    | Description                                                                                                        |
|---------|--------------------------------------------------------------------------------------------------------------------|
| HCLK    | The bus clock times all bus transfers. All signal timings are related to the rising edge of HCLK.                  |
| HRESETn | The bus reset signal is active LOW and resets the system and the bus. This is the only active LOW AHB-Lite signal. |

## Global signals

| Name        | Destination       | Description                                                                                                                                                                                                                                             |
|-------------|-------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| HADDR[31:0] | Slave and decoder | The 32-bit system address bus.                                                                                                                                                                                                                          |
| HBURST[2:0] | Slave             | The burst type indicates if the transfer is a single transfer or forms part of a burst. Fixed length bursts of 4, 8, and 16 beats are supported. The burst can be incrementing or wrapping. Incrementing bursts of undefined length are also supported. |

| HSIZE[2:0]    | Slave | Indicates the size of the transfer, that is typically byte, halfword, or word. The protocol allows for larger transfer sizes up to a maximum of 1024 bits                                                                     |
|---------------|-------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| HTRANS[1:0]   | Slave | Indicates the transfer type of the current transfer. This can be: • IDLE • BUSY • NONSEQUENTIAL • SEQUENTIAL.                                                                                                                 |
| HWDATA[255:0] | Slave | The write data bus transfers data from the master to the slaves during write operations. A minimum data bus width of 32 bits is recommended. However, this can be extended to enable higher bandwidth operation.              |
| HWRITE        | Slave | Indicates the transfer direction. When HIGH this signal indicates a write transfer and when LOW a read transfer. It has the same timing as the address signals, however, it must remain constant throughout a burst transfer. |

Master signals

| Name          | Destination | Description                                                                                                                                                                                                                                                                         |
|---------------|-------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| HRDATA[255:0] | Multiplexor | During read operations, the read data bus transfers data from the selected slave to the multiplexor. The multiplexor then transfers the data to the master. A minimum data bus width of 32 bits is recommended. However, this can be extended to enable higher bandwidth operation. |

| HREADYOUT | Multiplexor | When HIGH, the HREADYOUT signal indicates that a transfer has finished on the bus. This signal can be driven LOW to extend a transfer.                                                                                                                                                     |
|-----------|-------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| HRESP     | Multiplexor | The transfer response, after passing through the multiplexor, provides the master with additional information on the status of a transfer. When LOW, the HRESP signal indicates that the transfer status is OKAY. When HIGH, the HRESP signal indicates that the transfer status is ERROR. |

Slave signals

| Name         | Destination      | Description                                                                                                  |
|--------------|------------------|--------------------------------------------------------------------------------------------------------------|
| HRDATA[31:0] | Master           | Read data bus, selected by the decoder.                                                                      |
| HREADY       | Master and Slave | When HIGH, the HREADY signal indicates to the master and all slaves, that the previous transfer is complete. |
| HRESP        | Master           | Transfer response, selected by the decoder                                                                   |

Multiplexor

| HTRANS[1:0] | ТУРЕ                   |
|-------------|------------------------|
| b00         | IDLE                   |
| b01         | BUSY                   |
| b10         | NONSEQ(non-sequential) |

Transfer types

**Table 3-2 Transfer size encoding** 

| HSIZE[2] | HSIZE[1] | HSIZE[0] | Size<br>(bits) | Description |
|----------|----------|----------|----------------|-------------|
| 0        | 0        | 0        | 8              | Byte        |
| 0        | 0        | 1        | 16             | Halfword    |
| 0        | 1        | 0        | 32             | Word        |
| 0        | 1        | 1        | 64             | Doubleword  |
| 1        | 0        | 0        | 128            | 4-word line |
| 1        | 0        | 1        | 256            | 8-word line |
| 1        | 1        | 0        | 512            | -           |
| 1        | 1        | 1        | 1024           | -           |

Table 3-3 Burst signal encoding

| Туре   | Description                                |  |
|--------|--------------------------------------------|--|
| SINGLE | Single burst                               |  |
| INCR   | Incrementing burst of undefined length     |  |
| WRAP4  | 4-beat wrapping burst                      |  |
| INCR4  | 4-beat incrementing burst                  |  |
| WRAP8  | 8-beat wrapping burst                      |  |
| INCR8  | 8-beat incrementing burst                  |  |
| WRAP16 | 16-beat wrapping burst                     |  |
| INCR16 | 16-beat incrementing burst                 |  |
|        | SINGLE INCR WRAP4 INCR4 WRAP8 INCR8 WRAP16 |  |



Figure 3-9 Four-beat incrementing burst

# **Reference Cited:**

"AMBA® 3 AHB-Lite Protocol," 2006. [Online]. Available: http://mazsola.iit.uni-miskolc.hu/ $\sim$ drdani/docs\_arm/IHI0033A\_AMBA3\_AHB\_Lite.pdf. [Accessed: 24-Feb-2018].